1.A. Refer above table and find the joint probability of the people who planned to purchase and actually placed an order.
print('The joint probability of the people who planned to purchase and actually placed an order:',400/2000)
The joint probability of the people who planned to purchase and actually placed an order: 0.2
1.B find the joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase
print('joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase:',400/500)
joint probability of the people who planned to purchase and actually placed an order, given that people planned to purchase: 0.8
An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.
Decalaring the sample size n = 10
p, probabilty of success(failure rate) = 0.05
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
from scipy.stats import binom,poisson
from datetime import date,datetime
import plotly.express as px
n = 10
p = 0.05
k = np.arange(0,11)
binomial = binom.pmf(k=k,n=n,p=p)
binomial
array([5.98736939e-01, 3.15124705e-01, 7.46347985e-02, 1.04750594e-02,
9.64808106e-04, 6.09352488e-05, 2.67259863e-06, 8.03789063e-08,
1.58642578e-09, 1.85546875e-11, 9.76562500e-14])
print('Probability that none of the items are defective:',round(binomial[0],2))
Probability that none of the items are defective: 0.6
print(' Probability that exactly one of the items is defective:',round(binomial[1],2))
Probability that exactly one of the items is defective: 0.32
print('Probability that two or fewer of the items are defective',round(binomial[0]+binomial[1]+binomial[2],3))
Probability that two or fewer of the items are defective 0.988
print('Probability that three or more of the items are defective',round(1 - (binomial[0]+binomial[1]+binomial[2]),3))
Probability that three or more of the items are defective 0.012
A car salesman sells on an average 3 cars per week.
3A. What is Probability that in a given week he will sell some cars?
Here sell some cars means 1 or more than 1 cars
# lambda (rate) is 3
rate = 3
n = np.arange(0,15)
poisson = stats.poisson.pmf(n,rate)
poisson
array([4.97870684e-02, 1.49361205e-01, 2.24041808e-01, 2.24041808e-01,
1.68031356e-01, 1.00818813e-01, 5.04094067e-02, 2.16040315e-02,
8.10151179e-03, 2.70050393e-03, 8.10151179e-04, 2.20950322e-04,
5.52375804e-05, 1.27471339e-05, 2.73152870e-06])
# 1 or more means 1 - probabilty of zero cars
x = 1 - poisson[0]
print("Probability that in a given week he will sell some cars:",x)
print()
Probability that in a given week he will sell some cars: 0.950212931632136
3.B. What is Probability that in a given week he will sell 2 or more but less than 5 cars?
x = poisson[2] + poisson[3] + poisson[4]
print('Probability that in a given week he will sell 2 or more but less than 5 cars:',x)
print()
Probability that in a given week he will sell 2 or more but less than 5 cars: 0.6161149710523164
3.C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per week?
plt.plot(n,poisson,color='green', marker='o', linestyle='dashed',linewidth=2, markersize=5)
plt.show()
Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed,
marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the
percentage of orders that are taken correctly is 86.8%. Suppose that you place an order with the bot and two friends of yours
independently place orders with the same bot.
Answer the following questions.
Let P_A be probabilty of correctly placed orders = 0.868
Let P_B be probabilty of not correctly placed orders = 1 - P_A
4.A. What is the probability that all three orders will be recognised correctly?
P_A = 0.868
P_B = 1 - P_A
X = P_A*P_A*P_A
print("The probability that all three orders will be recognised correctly:",np.round(X,2))
The probability that all three orders will be recognised correctly: 0.65
4.B. What is the probability that none of the three orders will be recognised correctly?
X = P_B * P_B * P_B
X
0.0022999680000000003
print("The probability that none of the three orders will be recognised correctly:",X)
The probability that none of the three orders will be recognised correctly: 0.0022999680000000003
4.C. What is the probability that at least two of the three orders will be recognised correctly? [
X = P_A*P_A*P_B
A = P_A*P_B*P_A
B = P_B*P_A*P_A
Y = P_A*P_A*P_A
print(' the probability that at least two of the three orders will be recognised correctly',X+Y+A+B)
the probability that at least two of the three orders will be recognised correctly 0.952327936
Explain 1 real life industry scenario (other than the ones mentioned above) where you can use the concepts learnt in this module of Applied Statistics to get data driven business solution.
If the frequency of a certain health condition in the population is 10%, what is the probability that among 10 patients a doctor will see no more than 2 patients with that condition? Whether or not a random person has the condition is like a coin flip, they either do or do not, with probability of having the condition ("success")
P = 0.1. Among n = 10 patients, the probability of 3 or fewer having the condition can be found using the cumulative distribution function
Company X manages the men's top professional basketball division of the American league system. The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.
Basketball.csv - The data set contains information on all the teams so far participated in all the past tournaments.
Company X manages the men's top professional basketball division of the American league system. The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.
Basketball.csv - The data set contains information on all the teams so far participated in all the past tournaments.
Company’s management wants to invest on proposals on managing some of the best teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them
df = pd.read_csv('BasketBall.csv')
df.head()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931to32 | 1 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
df.isna().sum()
Team 0 Tournament 0 Score 0 PlayedGames 0 WonGames 0 DrawnGames 0 LostGames 0 BasketScored 0 BasketGiven 0 TournamentChampion 0 Runner-up 0 TeamLaunch 0 HighestPositionHeld 0 dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null object 3 PlayedGames 61 non-null object 4 WonGames 61 non-null object 5 DrawnGames 61 non-null object 6 LostGames 61 non-null object 7 BasketScored 61 non-null object 8 BasketGiven 61 non-null object 9 TournamentChampion 61 non-null object 10 Runner-up 61 non-null object 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(2), object(11) memory usage: 6.3+ KB
df1 = df.copy()
# checking data types and null values if present
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null object 3 PlayedGames 61 non-null object 4 WonGames 61 non-null object 5 DrawnGames 61 non-null object 6 LostGames 61 non-null object 7 BasketScored 61 non-null object 8 BasketGiven 61 non-null object 9 TournamentChampion 61 non-null object 10 Runner-up 61 non-null object 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(2), object(11) memory usage: 6.3+ KB
df1.columns
Index(['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'TeamLaunch', 'HighestPositionHeld'],
dtype='object')
# checking unique values in each column
for i in df1.columns:
print('Unique values in {}'.format(i))
print(df1[i].unique())
print()
Unique values in Team ['Team 1' 'Team 2' 'Team 3' 'Team 4' 'Team 5' 'Team 6' 'Team 7' 'Team 8' 'Team 9' 'Team 10' 'Team 11' 'Team 12' 'Team 13' 'Team 14' 'Team 15' 'Team 16' 'Team 17' 'Team 18' 'Team 19' 'Team 20' 'Team 21' 'Team 22' 'Team 23' 'Team 24' 'Team 25' 'Team 26' 'Team 27' 'Team 28' 'Team 29' 'Team 30' 'Team 31' 'Team 32' 'Team 33' 'Team 34' 'Team 35' 'Team 36' 'Team 37' 'Team 38' 'Team 39' 'Team 40' 'Team 41' 'Team 42' 'Team 43' 'Team 44' 'Team 45' 'Team 46' 'Team 47' 'Team 48' 'Team 49' 'Team 50' 'Team 51' 'Team 52' 'Team 53' 'Team 54' 'Team 55' 'Team 56' 'Team 57' 'Team 58' 'Team 59' 'Team 60' 'Team 61'] Unique values in Tournament [86 80 82 73 70 58 51 45 42 44 43 37 36 38 27 33 17 23 21 12 20 13 18 11 14 9 7 6 4 5 3 2 1] Unique values in Score ['4385' '4262' '3442' '3386' '3368' '2819' '2792' '2573' '2109' '1884' '1814' '1789' '1471' '1416' '1389' '1351' '1314' '1174' '1148' '1020' '970' '667' '662' '606' '553' '538' '510' '445' '421' '416' '375' '353' '343' '293' '285' '277' '242' '230' '190' '188' '168' '150' '148' '132' '107' '96' '91' '83' '81' '76' '71' '56' '52' '42' '40' '35' '34' '22' '19' '14' '-'] Unique values in PlayedGames ['2762' '2614' '2664' '2408' '2626' '2302' '1986' '1728' '1530' '1698' '1466' '1428' '1458' '1318' '1255' '1192' '988' '1096' '646' '742' '652' '678' '456' '628' '494' '586' '380' '402' '423' '426' '448' '346' '334' '270' '228' '282' '160' '186' '204' '180' '152' '114' '130' '116' '80' '108' '90' '72' '68' '54' '38' '30' '-'] Unique values in WonGames ['1647' '1581' '1241' '1187' '1209' '990' '948' '864' '698' '606' '563' '586' '463' '453' '471' '426' '390' '408' '333' '367' '266' '218' '189' '203' '147' '184' '155' '145' '125' '113' '123' '129' '104' '96' '103' '76' '62' '82' '52' '50' '59' '53' '37' '35' '43' '26' '34' '20' '19' '30' '29' '21' '17' '18' '13' '8' '7' '5' '-'] Unique values in DrawnGames ['552' '573' '598' '616' '633' '531' '608' '577' '522' '440' '392' '389' '384' '336' '358' '327' '330' '292' '256' '242' '172' '175' '148' '180' '112' '149' '128' '143' '81' '95' '102' '127' '92' '79' '76' '56' '63' '45' '46' '50' '44' '37' '27' '21' '16' '23' '24' '13' '14' '18' '6' '11' '10' '8' '5' '4' '-'] Unique values in LostGames ['563' '608' '775' '861' '920' '887' '1070' '766' '682' '575' '723' '619' '639' '629' '565' '535' '492' '399' '487' '208' '349' '305' '295' '197' '211' '298' '174' '194' '198' '202' '217' '158' '152' '118' '110' '137' '63' '90' '95' '83' '78' '52' '66' '44' '37' '62' '48' '33' '30' '41' '19' '20' '15' '18' '21' '-'] Unique values in BasketScored ['5947' '5900' '4534' '4398' '4631' '3680' '3609' '3228' '2683' '2159' '2052' '2278' '1767' '1843' '1753' '1500' '1421' '1642' '1182' '1347' '892' '819' '760' '750' '520' '716' '619' '607' '458' '430' '422' '492' '393' '291' '419' '320' '244' '285' '199' '202' '216' '165' '155' '139' '227' '101' '181' '62' '70' '145' '121' '153' '71' '97' '36' '38' '37' '51' '34' '-'] Unique values in BasketGiven ['3140' '3114' '3309' '3469' '3700' '3373' '3889' '3230' '2847' '2492' '2188' '2624' '2180' '2368' '2152' '1834' '1763' '1951' '1371' '1746' '789' '1157' '1088' '1022' '633' '1050' '744' '992' '623' '632' '581' '720' '662' '489' '588' '410' '366' '430' '241' '296' '310' '221' '253' '167' '308' '139' '295' '117' '115' '252' '183' '184' '116' '131' '182' '55' '66' '57' '85' '65' '-'] Unique values in TournamentChampion ['33' '25' '10' '6' '8' '1' '-' '2'] Unique values in Runner-up ['23' '25' '8' '6' '7' '4' '-' '3' '1' '5'] Unique values in TeamLaunch ['1929' '1931to32' '1934-35' '1939-40' '1932-33' '1941to42' '1948-49' '1944_45' '1935-36' '1949_50' '1933to34' '1960-61' '1951-52' '1998-99' '1941-42' '1977-78' '1959-60' '2004to05' '1961-62' '1940-41' '1930-31' '1963-64' '1974-75' '1943-44' '1987-88' '1991_92' '2007-08' '1962-63' '1994-95' '1978-79' '1971-72' '1999to00' '2014-15' '1990-91' '1947-48' '1996-97' '1995-96' '1945-46' '1953-54' '1979-80' '1950-51' '2016_17' '2009-10' '1956-57' '1951~52' '1955-56' '2017~18'] Unique values in HighestPositionHeld [ 1 3 2 4 6 8 5 11 7 12 10 17 9 19 14 16 20 15]
We found '-' value in some of the columns and it should be treated accordingly
df1['TournamentChampion'].value_counts()
- 52 1 3 10 1 25 1 6 1 2 1 33 1 8 1 Name: TournamentChampion, dtype: int64
# df1.query('TournamentChampion == "-"')[['Team','TournamentChampion']]
df1.query('Score == "-"')[['Team','Score']]
| Team | Score | |
|---|---|---|
| 60 | Team 61 | - |
# Replaceing "-" values with 0
df1.replace('-','0',inplace = True)
# Checking again each column with unique values and 0 is imputed against "-"
for i in df1.columns:
print('Unique values in {}'.format(i))
print(df1[i].unique())
print()
Unique values in Team ['Team 1' 'Team 2' 'Team 3' 'Team 4' 'Team 5' 'Team 6' 'Team 7' 'Team 8' 'Team 9' 'Team 10' 'Team 11' 'Team 12' 'Team 13' 'Team 14' 'Team 15' 'Team 16' 'Team 17' 'Team 18' 'Team 19' 'Team 20' 'Team 21' 'Team 22' 'Team 23' 'Team 24' 'Team 25' 'Team 26' 'Team 27' 'Team 28' 'Team 29' 'Team 30' 'Team 31' 'Team 32' 'Team 33' 'Team 34' 'Team 35' 'Team 36' 'Team 37' 'Team 38' 'Team 39' 'Team 40' 'Team 41' 'Team 42' 'Team 43' 'Team 44' 'Team 45' 'Team 46' 'Team 47' 'Team 48' 'Team 49' 'Team 50' 'Team 51' 'Team 52' 'Team 53' 'Team 54' 'Team 55' 'Team 56' 'Team 57' 'Team 58' 'Team 59' 'Team 60' 'Team 61'] Unique values in Tournament [86 80 82 73 70 58 51 45 42 44 43 37 36 38 27 33 17 23 21 12 20 13 18 11 14 9 7 6 4 5 3 2 1] Unique values in Score ['4385' '4262' '3442' '3386' '3368' '2819' '2792' '2573' '2109' '1884' '1814' '1789' '1471' '1416' '1389' '1351' '1314' '1174' '1148' '1020' '970' '667' '662' '606' '553' '538' '510' '445' '421' '416' '375' '353' '343' '293' '285' '277' '242' '230' '190' '188' '168' '150' '148' '132' '107' '96' '91' '83' '81' '76' '71' '56' '52' '42' '40' '35' '34' '22' '19' '14' '0'] Unique values in PlayedGames ['2762' '2614' '2664' '2408' '2626' '2302' '1986' '1728' '1530' '1698' '1466' '1428' '1458' '1318' '1255' '1192' '988' '1096' '646' '742' '652' '678' '456' '628' '494' '586' '380' '402' '423' '426' '448' '346' '334' '270' '228' '282' '160' '186' '204' '180' '152' '114' '130' '116' '80' '108' '90' '72' '68' '54' '38' '30' '0'] Unique values in WonGames ['1647' '1581' '1241' '1187' '1209' '990' '948' '864' '698' '606' '563' '586' '463' '453' '471' '426' '390' '408' '333' '367' '266' '218' '189' '203' '147' '184' '155' '145' '125' '113' '123' '129' '104' '96' '103' '76' '62' '82' '52' '50' '59' '53' '37' '35' '43' '26' '34' '20' '19' '30' '29' '21' '17' '18' '13' '8' '7' '5' '0'] Unique values in DrawnGames ['552' '573' '598' '616' '633' '531' '608' '577' '522' '440' '392' '389' '384' '336' '358' '327' '330' '292' '256' '242' '172' '175' '148' '180' '112' '149' '128' '143' '81' '95' '102' '127' '92' '79' '76' '56' '63' '45' '46' '50' '44' '37' '27' '21' '16' '23' '24' '13' '14' '18' '6' '11' '10' '8' '5' '4' '0'] Unique values in LostGames ['563' '608' '775' '861' '920' '887' '1070' '766' '682' '575' '723' '619' '639' '629' '565' '535' '492' '399' '487' '208' '349' '305' '295' '197' '211' '298' '174' '194' '198' '202' '217' '158' '152' '118' '110' '137' '63' '90' '95' '83' '78' '52' '66' '44' '37' '62' '48' '33' '30' '41' '19' '20' '15' '18' '21' '0'] Unique values in BasketScored ['5947' '5900' '4534' '4398' '4631' '3680' '3609' '3228' '2683' '2159' '2052' '2278' '1767' '1843' '1753' '1500' '1421' '1642' '1182' '1347' '892' '819' '760' '750' '520' '716' '619' '607' '458' '430' '422' '492' '393' '291' '419' '320' '244' '285' '199' '202' '216' '165' '155' '139' '227' '101' '181' '62' '70' '145' '121' '153' '71' '97' '36' '38' '37' '51' '34' '0'] Unique values in BasketGiven ['3140' '3114' '3309' '3469' '3700' '3373' '3889' '3230' '2847' '2492' '2188' '2624' '2180' '2368' '2152' '1834' '1763' '1951' '1371' '1746' '789' '1157' '1088' '1022' '633' '1050' '744' '992' '623' '632' '581' '720' '662' '489' '588' '410' '366' '430' '241' '296' '310' '221' '253' '167' '308' '139' '295' '117' '115' '252' '183' '184' '116' '131' '182' '55' '66' '57' '85' '65' '0'] Unique values in TournamentChampion ['33' '25' '10' '6' '8' '1' '0' '2'] Unique values in Runner-up ['23' '25' '8' '6' '7' '4' '0' '3' '1' '5'] Unique values in TeamLaunch ['1929' '1931to32' '1934-35' '1939-40' '1932-33' '1941to42' '1948-49' '1944_45' '1935-36' '1949_50' '1933to34' '1960-61' '1951-52' '1998-99' '1941-42' '1977-78' '1959-60' '2004to05' '1961-62' '1940-41' '1930-31' '1963-64' '1974-75' '1943-44' '1987-88' '1991_92' '2007-08' '1962-63' '1994-95' '1978-79' '1971-72' '1999to00' '2014-15' '1990-91' '1947-48' '1996-97' '1995-96' '1945-46' '1953-54' '1979-80' '1950-51' '2016_17' '2009-10' '1956-57' '1951~52' '1955-56' '2017~18'] Unique values in HighestPositionHeld [ 1 3 2 4 6 8 5 11 7 12 10 17 9 19 14 16 20 15]
We have removed the "-" from the data and imputed with 0
# [df1[i].astype(int) for i in df1.columns[1:11]]
df1.isna().sum()
Team 0 Tournament 0 Score 0 PlayedGames 0 WonGames 0 DrawnGames 0 LostGames 0 BasketScored 0 BasketGiven 0 TournamentChampion 0 Runner-up 0 TeamLaunch 0 HighestPositionHeld 0 dtype: int64
# COnverting the datatypes into int
for i in df1.columns[1:]:
if i != 'TeamLaunch':
df1[i] = df1[i].astype(int)
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null int64 3 PlayedGames 61 non-null int64 4 WonGames 61 non-null int64 5 DrawnGames 61 non-null int64 6 LostGames 61 non-null int64 7 BasketScored 61 non-null int64 8 BasketGiven 61 non-null int64 9 TournamentChampion 61 non-null int64 10 Runner-up 61 non-null int64 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(11), object(2) memory usage: 6.3+ KB
df1['TeamLaunch'].unique()
array(['1929', '1931to32', '1934-35', '1939-40', '1932-33', '1941to42',
'1948-49', '1944_45', '1935-36', '1949_50', '1933to34', '1960-61',
'1951-52', '1998-99', '1941-42', '1977-78', '1959-60', '2004to05',
'1961-62', '1940-41', '1930-31', '1963-64', '1974-75', '1943-44',
'1987-88', '1991_92', '2007-08', '1962-63', '1994-95', '1978-79',
'1971-72', '1999to00', '2014-15', '1990-91', '1947-48', '1996-97',
'1995-96', '1945-46', '1953-54', '1979-80', '1950-51', '2016_17',
'2009-10', '1956-57', '1951~52', '1955-56', '2017~18'],
dtype=object)
We found that Team Launch is not correct with values and will format the values with initial year as launchYear using below function
# this function will take the initial year from Teamlaunch
def clean(x):
if 'to' in x:
return x.split(sep = 'to')[0]
elif '-' in x:
return x.split(sep = '-')[0]
elif '_' in x:
return x.split(sep = '_')[0]
elif '~' in x:
return x.split(sep = '~')[0]
else:
return x
df1['TeamLaunch'] = df1['TeamLaunch'].apply(clean)
df1.columns
Index(['Team', 'Tournament', 'Score', 'PlayedGames', 'WonGames', 'DrawnGames',
'LostGames', 'BasketScored', 'BasketGiven', 'TournamentChampion',
'Runner-up', 'TeamLaunch', 'HighestPositionHeld'],
dtype='object')
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null int64 3 PlayedGames 61 non-null int64 4 WonGames 61 non-null int64 5 DrawnGames 61 non-null int64 6 LostGames 61 non-null int64 7 BasketScored 61 non-null int64 8 BasketGiven 61 non-null int64 9 TournamentChampion 61 non-null int64 10 Runner-up 61 non-null int64 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(11), object(2) memory usage: 6.3+ KB
df1.head()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 |
# Again checking gor unique values in the column
for i in df1.columns:
print('Unique values in {}'.format(i))
print(df1[i].unique())
print()
Unique values in Team
['Team 1' 'Team 2' 'Team 3' 'Team 4' 'Team 5' 'Team 6' 'Team 7' 'Team 8'
'Team 9' 'Team 10' 'Team 11' 'Team 12' 'Team 13' 'Team 14' 'Team 15'
'Team 16' 'Team 17' 'Team 18' 'Team 19' 'Team 20' 'Team 21' 'Team 22'
'Team 23' 'Team 24' 'Team 25' 'Team 26' 'Team 27' 'Team 28' 'Team 29'
'Team 30' 'Team 31' 'Team 32' 'Team 33' 'Team 34' 'Team 35' 'Team 36'
'Team 37' 'Team 38' 'Team 39' 'Team 40' 'Team 41' 'Team 42' 'Team 43'
'Team 44' 'Team 45' 'Team 46' 'Team 47' 'Team 48' 'Team 49' 'Team 50'
'Team 51' 'Team 52' 'Team 53' 'Team 54' 'Team 55' 'Team 56' 'Team 57'
'Team 58' 'Team 59' 'Team 60' 'Team 61']
Unique values in Tournament
[86 80 82 73 70 58 51 45 42 44 43 37 36 38 27 33 17 23 21 12 20 13 18 11
14 9 7 6 4 5 3 2 1]
Unique values in Score
[4385 4262 3442 3386 3368 2819 2792 2573 2109 1884 1814 1789 1471 1416
1389 1351 1314 1174 1148 1020 970 667 662 606 553 538 510 445
421 416 375 353 343 293 285 277 242 230 190 188 168 150
148 132 107 96 91 83 81 76 71 56 52 42 40 35
34 22 19 14 0]
Unique values in PlayedGames
[2762 2614 2664 2408 2626 2302 1986 1728 1530 1698 1466 1428 1458 1318
1255 1192 988 1096 646 742 652 678 456 628 494 586 380 402
423 426 448 346 334 270 228 282 160 186 204 180 152 114
130 116 80 108 90 72 68 54 38 30 0]
Unique values in WonGames
[1647 1581 1241 1187 1209 990 948 864 698 606 563 586 463 453
471 426 390 408 333 367 266 218 189 203 147 184 155 145
125 113 123 129 104 96 103 76 62 82 52 50 59 53
37 35 43 26 34 20 19 30 29 21 17 18 13 8
7 5 0]
Unique values in DrawnGames
[552 573 598 616 633 531 608 577 522 440 392 389 384 336 358 327 330 292
256 242 172 175 148 180 112 149 128 143 81 95 102 127 92 79 76 56
63 45 46 50 44 37 27 21 16 23 24 13 14 18 6 11 10 8
5 4 0]
Unique values in LostGames
[ 563 608 775 861 920 887 1070 766 682 575 723 619 639 629
565 535 492 399 487 208 349 305 295 197 211 298 174 194
198 202 217 158 152 118 110 137 63 90 95 83 78 52
66 44 37 62 48 33 30 41 19 20 15 18 21 0]
Unique values in BasketScored
[5947 5900 4534 4398 4631 3680 3609 3228 2683 2159 2052 2278 1767 1843
1753 1500 1421 1642 1182 1347 892 819 760 750 520 716 619 607
458 430 422 492 393 291 419 320 244 285 199 202 216 165
155 139 227 101 181 62 70 145 121 153 71 97 36 38
37 51 34 0]
Unique values in BasketGiven
[3140 3114 3309 3469 3700 3373 3889 3230 2847 2492 2188 2624 2180 2368
2152 1834 1763 1951 1371 1746 789 1157 1088 1022 633 1050 744 992
623 632 581 720 662 489 588 410 366 430 241 296 310 221
253 167 308 139 295 117 115 252 183 184 116 131 182 55
66 57 85 65 0]
Unique values in TournamentChampion
[33 25 10 6 8 1 0 2]
Unique values in Runner-up
[23 25 8 6 7 4 0 3 1 5]
Unique values in TeamLaunch
['1929' '1931' '1934' '1939' '1932' '1941' '1948' '1944' '1935' '1949'
'1933' '1960' '1951' '1998' '1977' '1959' '2004' '1961' '1940' '1930'
'1963' '1974' '1943' '1987' '1991' '2007' '1962' '1994' '1978' '1971'
'1999' '2014' '1990' '1947' '1996' '1995' '1945' '1953' '1979' '1950'
'2016' '2009' '1956' '1955' '2017']
Unique values in HighestPositionHeld
[ 1 3 2 4 6 8 5 11 7 12 10 17 9 19 14 16 20 15]
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null int64 3 PlayedGames 61 non-null int64 4 WonGames 61 non-null int64 5 DrawnGames 61 non-null int64 6 LostGames 61 non-null int64 7 BasketScored 61 non-null int64 8 BasketGiven 61 non-null int64 9 TournamentChampion 61 non-null int64 10 Runner-up 61 non-null int64 11 TeamLaunch 61 non-null object 12 HighestPositionHeld 61 non-null int64 dtypes: int64(11), object(2) memory usage: 6.3+ KB
datetime.today()
datetime.datetime(2022, 7, 10, 17, 17, 6, 102451)
from datetime import date,datetime
date.today()
datetime.date(2022, 7, 10)
current_year = datetime.today().year
datetime.now()
datetime.datetime(2022, 7, 10, 17, 17, 6, 123505)
df1['TeamLaunch'] = df1['TeamLaunch'].astype(int)
# Calculating Age of team
df1['Age'] = current_year - df1['TeamLaunch']
df1['Age'].value_counts()
93 10 81 3 83 2 71 2 59 2 45 2 87 2 35 1 32 1 44 1 28 1 69 1 48 1 13 1 61 1 31 1 51 1 66 1 72 1 43 1 6 1 60 1 24 1 62 1 15 1 92 1 5 1 78 1 82 1 23 1 8 1 73 1 74 1 75 1 77 1 79 1 67 1 18 1 27 1 26 1 88 1 89 1 90 1 91 1 63 1 Name: Age, dtype: int64
df1['TournamentChampion'].value_counts()
0 52 1 3 2 1 6 1 8 1 10 1 25 1 33 1 Name: TournamentChampion, dtype: int64
df1.query('TournamentChampion >=1').sort_values(by = 'TournamentChampion', ascending = False)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | 93 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | 93 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 | 93 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | 93 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 | 91 |
| 7 | Team 8 | 70 | 2573 | 2302 | 864 | 577 | 861 | 3228 | 3230 | 2 | 3 | 1929 | 1 | 93 |
| 5 | Team 6 | 73 | 2819 | 2408 | 990 | 531 | 887 | 3680 | 3373 | 1 | 4 | 1934 | 1 | 88 |
| 9 | Team 10 | 51 | 1884 | 1728 | 606 | 440 | 682 | 2159 | 2492 | 1 | 0 | 1932 | 1 | 90 |
| 10 | Team 11 | 45 | 1814 | 1530 | 563 | 392 | 575 | 2052 | 2188 | 1 | 5 | 1941 | 1 | 81 |
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 61 entries, 0 to 60 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 61 non-null object 1 Tournament 61 non-null int64 2 Score 61 non-null int64 3 PlayedGames 61 non-null int64 4 WonGames 61 non-null int64 5 DrawnGames 61 non-null int64 6 LostGames 61 non-null int64 7 BasketScored 61 non-null int64 8 BasketGiven 61 non-null int64 9 TournamentChampion 61 non-null int64 10 Runner-up 61 non-null int64 11 TeamLaunch 61 non-null int64 12 HighestPositionHeld 61 non-null int64 13 Age 61 non-null int64 dtypes: int64(13), object(1) memory usage: 6.8+ KB
df1.head()
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | Age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | 93 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | 93 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 | 93 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 | 91 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | 93 |
Creating New columns for getting insight as
df1['winn_pct'] = np.round((df1['WonGames'] / df1['PlayedGames']) * 100)
df1['lose_pct'] = np.round((df1['LostGames'] / df1['PlayedGames']) * 100)
df1['Draw_pct'] = np.round((df1['DrawnGames'] / df1['PlayedGames']) * 100)
df1['Tour_cham_prct'] = np.round((df1['TournamentChampion'] / df1['Tournament']) * 100)
df1['runn_up_pct'] = np.round((df1['Runner-up'] / df1['Tournament']) * 100)
df1['Avg_basket_scored_pergame'] = ((df1['BasketScored'] / df1['PlayedGames']))
df1.head(20)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | Age | winn_pct | lose_pct | Draw_pct | Tour_cham_prct | runn_up_pct | Avg_basket_scored_pergame | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | 93 | 60.0 | 20.0 | 20.0 | 38.0 | 27.0 | 2.153150 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | 93 | 57.0 | 22.0 | 21.0 | 29.0 | 29.0 | 2.136133 |
| 2 | Team 3 | 80 | 3442 | 2614 | 1241 | 598 | 775 | 4534 | 3309 | 10 | 8 | 1929 | 1 | 93 | 47.0 | 30.0 | 23.0 | 12.0 | 10.0 | 1.734507 |
| 3 | Team 4 | 82 | 3386 | 2664 | 1187 | 616 | 861 | 4398 | 3469 | 6 | 6 | 1931 | 1 | 91 | 45.0 | 32.0 | 23.0 | 7.0 | 7.0 | 1.650901 |
| 4 | Team 5 | 86 | 3368 | 2762 | 1209 | 633 | 920 | 4631 | 3700 | 8 | 7 | 1929 | 1 | 93 | 44.0 | 33.0 | 23.0 | 9.0 | 8.0 | 1.676684 |
| 5 | Team 6 | 73 | 2819 | 2408 | 990 | 531 | 887 | 3680 | 3373 | 1 | 4 | 1934 | 1 | 88 | 41.0 | 37.0 | 22.0 | 1.0 | 5.0 | 1.528239 |
| 6 | Team 7 | 82 | 2792 | 2626 | 948 | 608 | 1070 | 3609 | 3889 | 0 | 0 | 1929 | 3 | 93 | 36.0 | 41.0 | 23.0 | 0.0 | 0.0 | 1.374334 |
| 7 | Team 8 | 70 | 2573 | 2302 | 864 | 577 | 861 | 3228 | 3230 | 2 | 3 | 1929 | 1 | 93 | 38.0 | 37.0 | 25.0 | 3.0 | 4.0 | 1.402259 |
| 8 | Team 9 | 58 | 2109 | 1986 | 698 | 522 | 766 | 2683 | 2847 | 0 | 1 | 1939 | 2 | 83 | 35.0 | 39.0 | 26.0 | 0.0 | 2.0 | 1.350957 |
| 9 | Team 10 | 51 | 1884 | 1728 | 606 | 440 | 682 | 2159 | 2492 | 1 | 0 | 1932 | 1 | 90 | 35.0 | 39.0 | 25.0 | 2.0 | 0.0 | 1.249421 |
| 10 | Team 11 | 45 | 1814 | 1530 | 563 | 392 | 575 | 2052 | 2188 | 1 | 5 | 1941 | 1 | 81 | 37.0 | 38.0 | 26.0 | 2.0 | 11.0 | 1.341176 |
| 11 | Team 12 | 51 | 1789 | 1698 | 586 | 389 | 723 | 2278 | 2624 | 0 | 0 | 1939 | 4 | 83 | 35.0 | 43.0 | 23.0 | 0.0 | 0.0 | 1.341578 |
| 12 | Team 13 | 42 | 1471 | 1466 | 463 | 384 | 619 | 1767 | 2180 | 0 | 0 | 1948 | 4 | 74 | 32.0 | 42.0 | 26.0 | 0.0 | 0.0 | 1.205321 |
| 13 | Team 14 | 44 | 1416 | 1428 | 453 | 336 | 639 | 1843 | 2368 | 0 | 1 | 1929 | 2 | 93 | 32.0 | 45.0 | 24.0 | 0.0 | 2.0 | 1.290616 |
| 14 | Team 15 | 43 | 1389 | 1458 | 471 | 358 | 629 | 1753 | 2152 | 0 | 1 | 1944 | 2 | 78 | 32.0 | 43.0 | 25.0 | 0.0 | 2.0 | 1.202332 |
| 15 | Team 16 | 37 | 1351 | 1318 | 426 | 327 | 565 | 1500 | 1834 | 0 | 0 | 1935 | 4 | 87 | 32.0 | 43.0 | 25.0 | 0.0 | 0.0 | 1.138088 |
| 16 | Team 17 | 36 | 1314 | 1255 | 390 | 330 | 535 | 1421 | 1763 | 0 | 0 | 1949 | 4 | 73 | 31.0 | 43.0 | 26.0 | 0.0 | 0.0 | 1.132271 |
| 17 | Team 18 | 38 | 1174 | 1192 | 408 | 292 | 492 | 1642 | 1951 | 0 | 0 | 1933 | 3 | 89 | 34.0 | 41.0 | 24.0 | 0.0 | 0.0 | 1.377517 |
| 18 | Team 19 | 27 | 1148 | 988 | 333 | 256 | 399 | 1182 | 1371 | 0 | 0 | 1960 | 3 | 62 | 34.0 | 40.0 | 26.0 | 0.0 | 0.0 | 1.196356 |
| 19 | Team 20 | 33 | 1020 | 1096 | 367 | 242 | 487 | 1347 | 1746 | 0 | 1 | 1951 | 2 | 71 | 33.0 | 44.0 | 22.0 | 0.0 | 3.0 | 1.229015 |
df1.sort_values(by = ['TournamentChampion'],ascending = False)[['Team','TournamentChampion','Age']].head(10)
| Team | TournamentChampion | Age | |
|---|---|---|---|
| 0 | Team 1 | 33 | 93 |
| 1 | Team 2 | 25 | 93 |
| 2 | Team 3 | 10 | 93 |
| 4 | Team 5 | 8 | 93 |
| 3 | Team 4 | 6 | 91 |
| 7 | Team 8 | 2 | 93 |
| 9 | Team 10 | 1 | 90 |
| 10 | Team 11 | 1 | 81 |
| 5 | Team 6 | 1 | 88 |
| 38 | Team 39 | 0 | 28 |
Above data shows the top teams in terms of maximum tournaments won with their age
df1.sort_values(by = ['TournamentChampion'],ascending = False)[['Team','TournamentChampion']].head(10).set_index(['Team']).plot(kind = 'barh')
<AxesSubplot:ylabel='Team'>
top_10_tour_champ = df1.sort_values(by = ['TournamentChampion'],ascending = False)[['Team','TournamentChampion']].head(10)
top_10_tour_champ
| Team | TournamentChampion | |
|---|---|---|
| 0 | Team 1 | 33 |
| 1 | Team 2 | 25 |
| 2 | Team 3 | 10 |
| 4 | Team 5 | 8 |
| 3 | Team 4 | 6 |
| 7 | Team 8 | 2 |
| 9 | Team 10 | 1 |
| 10 | Team 11 | 1 |
| 5 | Team 6 | 1 |
| 38 | Team 39 | 0 |
fig = px.bar(top_10_tour_champ, x='Team', y='TournamentChampion',color='Team',text = 'TournamentChampion')
fig.show()
df1.head(1)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | Age | winn_pct | lose_pct | Draw_pct | Tour_cham_prct | runn_up_pct | Avg_basket_scored_pergame | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | 93 | 60.0 | 20.0 | 20.0 | 38.0 | 27.0 | 2.15315 |
top_10_Score = df1.sort_values(by = ['Score'],ascending = False)[['Team','Score']].head(10)
top_10_Score
| Team | Score | |
|---|---|---|
| 0 | Team 1 | 4385 |
| 1 | Team 2 | 4262 |
| 2 | Team 3 | 3442 |
| 3 | Team 4 | 3386 |
| 4 | Team 5 | 3368 |
| 5 | Team 6 | 2819 |
| 6 | Team 7 | 2792 |
| 7 | Team 8 | 2573 |
| 8 | Team 9 | 2109 |
| 9 | Team 10 | 1884 |
fig = px.bar(top_10_Score, x='Team', y='Score',color='Team',text = 'Score')
fig.show()
top_10_winn_pct = df1.sort_values(by = ['winn_pct'],ascending = False)[['Team','winn_pct']].head(10)
top_10_winn_pct
fig = px.bar(top_10_winn_pct, x='Team', y='winn_pct',color='Team',text = 'winn_pct')
fig.show()
top_10_winn_pct = df1.sort_values(by = ['winn_pct'],ascending = False)[['Team','winn_pct']].head(10)
top_10_winn_pct
fig = px.bar(top_10_winn_pct, x='Team', y='winn_pct',color='Team',text = 'winn_pct')
fig.show()
top_10_Avg_basket_scored_pergame = df1.sort_values(by = ['Avg_basket_scored_pergame'],ascending = False)[['Team','Avg_basket_scored_pergame']].head(10)
top_10_Avg_basket_scored_pergame
fig = px.bar(top_10_Avg_basket_scored_pergame, x='Team', y='Avg_basket_scored_pergame',color='Team',text = 'Avg_basket_scored_pergame')
fig.show()
df1.head(2)
| Team | Tournament | Score | PlayedGames | WonGames | DrawnGames | LostGames | BasketScored | BasketGiven | TournamentChampion | Runner-up | TeamLaunch | HighestPositionHeld | Age | winn_pct | lose_pct | Draw_pct | Tour_cham_prct | runn_up_pct | Avg_basket_scored_pergame | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Team 1 | 86 | 4385 | 2762 | 1647 | 552 | 563 | 5947 | 3140 | 33 | 23 | 1929 | 1 | 93 | 60.0 | 20.0 | 20.0 | 38.0 | 27.0 | 2.153150 |
| 1 | Team 2 | 86 | 4262 | 2762 | 1581 | 573 | 608 | 5900 | 3114 | 25 | 25 | 1929 | 1 | 93 | 57.0 | 22.0 | 21.0 | 29.0 | 29.0 | 2.136133 |
top_10_PlayedGames = df1.sort_values(by = ['PlayedGames'],ascending = False)[['Team','PlayedGames']].head(10)
top_10_PlayedGames
fig = px.bar(top_10_PlayedGames, x='Team', y='PlayedGames',color='Team',text = 'PlayedGames')
fig.show()
top_10_Draw_pct = df1.sort_values(by = ['Draw_pct'],ascending = False)[['Team','Draw_pct']].head(10)
top_10_Draw_pct
fig = px.bar(top_10_Draw_pct, x='Team', y='Draw_pct',color='Team',text = 'Draw_pct')
fig.show()
top_10_BasketScored = df1.sort_values(by = ['BasketScored'],ascending = False)[['Team','BasketScored']].head(10)
top_10_BasketScored
fig = px.bar(top_10_BasketScored, x='Team', y='BasketScored',color='Team',text = 'BasketScored')
fig.show()
Suggestions to the association management on quality, quantity, variety, velocity, veracity etc. on the data points collected by the association to perform a better data analysis in future
Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience,present in person and online
CompanyX_EU.csv - Each row in the dataset is a Start-up company and the columns describe the company
Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below
steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from
the dataset
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.weightstats import ztest
# Reading the csv -loading the data
df = pd.read_csv('CompanyX_EU.csv')
# Reading the csv -loading the data
df = pd.read_csv('CompanyX_EU.csv')
# Checking the values to undertsnad data
df.head()
| Startup | Product | Funding | Event | Result | OperatingState | |
|---|---|---|---|---|---|---|
| 0 | 2600Hz | 2600hz.com | NaN | Disrupt SF 2013 | Contestant | Operating |
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed |
| 2 | 3DPrinterOS | 3dprinteros.com | NaN | Disrupt SF 2016 | Contestant | Operating |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating |
| 4 | 42 Technologies | 42technologies.com | NaN | Disrupt NYC 2013 | Contestant | Operating |
# Checking the information regarding datatypes , null values if present
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 662 entries, 0 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 662 non-null object 1 Product 656 non-null object 2 Funding 448 non-null object 3 Event 662 non-null object 4 Result 662 non-null object 5 OperatingState 662 non-null object dtypes: object(6) memory usage: 31.2+ KB
Above information tells us that there are missing values for Funding Column ,Product Column
# Dropping the null values
df1 = df.dropna().copy(deep=True)
df1.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 446 entries, 1 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 446 non-null object 1 Product 446 non-null object 2 Funding 446 non-null object 3 Event 446 non-null object 4 Result 446 non-null object 5 OperatingState 446 non-null object dtypes: object(6) memory usage: 24.4+ KB
# Checking the values for Funding column
df1['Funding'].unique()
array(['$630K', '$1M', '$19.3M', '$29M', '$1.8M', '$750K', '$17.9K',
'$11.6M', '$929.2K', '$12.2M', '$1.9M', '$4M', '$3M', '$225K',
'$24M', '$746K', '$600K', '$7.1M', '$4.5M', '$50.9M', '$8.3M',
'$12.3M', '$190K', '$40M', '$520K', '$2.9M', '$320K', '$1.7M',
'$20K', '$205M', '$16.5M', '$2.8M', '$125K', '$350K', '$20.8M',
'$675K', '$6.5M', '$2M', '$118K', '$20.5M', '$7.2M', '$2.3M',
'$212K', '$7M', '$8.1M', '$8.5M', '$325K', '$150K', '$5M', '$60K',
'$8.8M', '$120K', '$3.1M', '$5.5M', '$2.2M', '$496K', '$1.4M',
'$32.5M', '$19M', '$475K', '$182.1M', '$6.7M', '$848K', '$1.3M',
'$12M', '$2.6M', '$411.5K', '$925K', '$35.4M', '$38M', '$37.1M',
'$6.8M', '$72M', '$64M', '$12.5M', '$900K', '$100K', '$7.3M',
'$983K', '$9.7M', '$5.4M', '$160K', '$300K', '$1.7B', '$10.1M',
'$485K', '$8M', '$4.3M', '$10M', '$34.6M', '$16.3M', '$2.4M',
'$650K', '$35.5M', '$15.4M', '$18.2M', '$30.1M', '$26M', '$1.2M',
'$790.6K', '$66M', '$28.3M', '$6.4M', '$8.2M', '$5K', '$310.5K',
'$11.3M', '$3.8M', '$700K', '$103M', '$334.6K', '$25.9M',
'$755.9K', '$36.5M', '$14.2M', '$500K', '$191.5K', '$44.7M',
'$566.1K', '$6.3M', '$24.2M', '$10.7M', '$22M', '$1.1M', '$3.2M',
'$195K', '$14.1M', '$50K', '$745K', '$457K', '$355K', '$696.3K',
'$122.4M', '$421.6K', '$34.9M', '$67.8M', '$11.7M', '$166.1M',
'$1.6M', '$4.4M', '$717K', '$5.8M', '$4.1M', '$42.1M', '$69M',
'$10.6M', '$65.1M', '$875K', '$250K', '$25.1M', '$635K', '$9.2M',
'$800K', '$9.4M', '$31.8M', '$3.3M', '$9.5M', '$3.5M', '$275K',
'$13.2M', '$12.8M', '$5.3M', '$37M', '$3.9M', '$651.4K', '$38.5M',
'$25.5M', '$892K', '$3.4M', '$1.5M', '$79.5K', '$156K', '$4.8M',
'$755K', '$525K', '$18.5M', '$278M', '$3.6M', '$63M', '$948K',
'$10.5M', '$7.4M', '$6M', '$20.3K', '$765K', '$145K', '$25M',
'$956K', '$7.7M', '$160M', '$590K', '$855.4K', '$165K', '$16.2M',
'$18M', '$440K', '$684.4K', '$129.8K', '$482K', '$200K', '$2.1M',
'$23K', '$863K', '$15.6M', '$47.4M', '$323.8K', '$5.1M', '$24.5M',
'$532K', '$4.2M', '$280K', '$9.8M', '$16M', '$93K', '$88K',
'$10.3M', '$13.1M', '$332.4M', '$7.8M', '$14.5M', '$35M',
'$168.8M', '$830.2K', '$28M', '$26.7K', '$55K', '$30.8M', '$85.7K',
'$2.5M', '$17.4M', '$15.8M', '$28K', '$41.8M', '$142M', '$117.8M',
'$62.1M', '$583.6M', '$223M', '$31.5M'], dtype=object)
Converting ‘Funding’ features to a million attribute and into numerical value
df1.loc[:,'Funds_in_million'] = df1['Funding'].apply(lambda x: float(x[1:-1])/1000 if x[-1] == 'K' else (float(x[1:-1])*1000 if x[-1] == 'B' else float(x[1:-1])))
df1.head()
| Startup | Product | Funding | Event | Result | OperatingState | Funds_in_million | |
|---|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | 0.63 |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | 1.00 |
| 5 | 5to1 | 5to1.com | $19.3M | TC50 2009 | Contestant | Acquired | 19.30 |
| 6 | 8 Securities | 8securities.com | $29M | Disrupt Beijing 2011 | Finalist | Operating | 29.00 |
| 10 | AdhereTech | adheretech.com | $1.8M | Hardware Battlefield 2014 | Contestant | Operating | 1.80 |
Creating a boxplot for funds in million.
sns.boxplot(data = df1, x = 'Funds_in_million')
plt.show()
plt.boxplot(df1['Funds_in_million'])
plt.show()
Checking the number of outliers present in Funds_in_million column using IQR
q1 = df1['Funds_in_million'].quantile(0.25)
q3 = df1['Funds_in_million'].quantile(0.75)
IQR = q3 - q1
upperfence = q3 + IQR*1.5
upperfence
22.569625
out_lier =len(df1[df1['Funds_in_million'] > upperfence])
print('Number of Outliers present :{}'.format(out_lier))
Number of Outliers present :60
lets see the distribution after removing outliers
# Removing the outliers from the dataset
df1.drop(df1[df1.Funds_in_million > upperfence].index, inplace = True)
plot = sns.boxplot(df1['Funds_in_million'])
plt.title('Boxplot of funds without the original outliers')
plt.ylabel("Funds raised (in Million)")
plt.show()
df1.head()
| Startup | Product | Funding | Event | Result | OperatingState | Funds_in_million | |
|---|---|---|---|---|---|---|---|
| 1 | 3DLT | 3dlt.com | $630K | Disrupt NYC 2013 | Contestant | Closed | 0.63 |
| 3 | 3Dprintler | 3dprintler.com | $1M | Disrupt NY 2016 | Audience choice | Operating | 1.00 |
| 5 | 5to1 | 5to1.com | $19.3M | TC50 2009 | Contestant | Acquired | 19.30 |
| 10 | AdhereTech | adheretech.com | $1.8M | Hardware Battlefield 2014 | Contestant | Operating | 1.80 |
| 11 | AdRocket | adrocket.com | $1M | TC50 2008 | Contestant | Closed | 1.00 |
# Check frequency of the OperatingState features classes
print('Frequency of the OperatingState features classes')
df1['OperatingState'].value_counts()
Frequency of the OperatingState features classes
Operating 275 Closed 56 Acquired 55 Name: OperatingState, dtype: int64
sns.distplot(df1['Funds_in_million'])
<AxesSubplot:xlabel='Funds_in_million', ylabel='Density'>
df1['Funds_in_million'].describe()
count 386.00000 mean 3.72514 std 4.73236 min 0.00500 25% 0.60000 50% 1.70000 75% 5.00000 max 22.00000 Name: Funds_in_million, dtype: float64
From above description , It is seen as 75% companies have raised less tha 5 million in funds and data has maximum value of 22 as well
fig, ax = plt.subplots(1, 2)
fig.set_figheight(5)
fig.set_figwidth(15)
sns.distplot(df1.loc[df1['OperatingState'] == 'Operating', 'Funds_in_million'], ax = ax[0])
sns.distplot(df1.loc[df1['OperatingState'] =='Closed', 'Funds_in_million'], ax = ax[1])
ax[0].set_title('Funds raised by the companies still operating')
ax[1].set_title('Funds raised by companies that got closed')
plt.show()
Null hypothesis (Ho) : No Difference between Funds raised by companies that are still operating vs companies that closed down
Alternate hypothesis (Ha) : Difference between Funds raised by companies that are still operating vs companies that closed down
from statsmodels.stats.weightstats import ztest
s1 = df1.loc[df1['OperatingState'] == 'Operating', 'Funds_in_million']
s2 = df1.loc[df1['OperatingState'] =='Closed', 'Funds_in_million']
alpha = 0.05 # Let's consider a significance level of 5%
testStatistic, pValue = ztest(s1, s2)
print('Pvalue : ',np.round(pValue,3))
print('alpha : ',np.round(alpha,3))
Pvalue : 0.192 alpha : 0.05
Copy of the original data frame
df2 = df.copy(deep=True)
df2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 662 entries, 0 to 661 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Startup 662 non-null object 1 Product 656 non-null object 2 Funding 448 non-null object 3 Event 662 non-null object 4 Result 662 non-null object 5 OperatingState 662 non-null object dtypes: object(6) memory usage: 31.2+ KB
Frequency distribution of Result variables
df2['Result'].value_counts()
Contestant 488 Finalist 84 Audience choice 41 Winner 26 Runner up 23 Name: Result, dtype: int64
df2['Result'].value_counts(normalize=True)*100
Contestant 73.716012 Finalist 12.688822 Audience choice 6.193353 Winner 3.927492 Runner up 3.474320 Name: Result, dtype: float64
# Filtering for Operating class only
df2_operating = df2.query('OperatingState == "Operating"')
df2_operating['Result'].value_counts()
Contestant 332 Finalist 61 Audience choice 32 Runner up 21 Winner 19 Name: Result, dtype: int64
There are 332 Contestant that are operating and 19 winner that are operating, So we can combine others choices into winners
winners_operating = df2_operating['Result'].value_counts()[1:].sum()
Contestant_operating = df2_operating['Result'].value_counts()['Contestant']
winners = df2['Result'].value_counts()[1:].sum()
Contestant = df2['Result'].value_counts()['Contestant']
Percentage of winners that are still operating and percentage of contestants that are still operating
winners_operating/winners ,Contestant_operating/Contestant
(0.764367816091954, 0.680327868852459)
Null hyputhesis (Ho): The proportion of companies that are operating is the same for - WINNERS and CONTESTANTS
Alternative hypothesis (Ha): The proportion of companies that are operating is significantly different from each other, between WINNERS and CONTESTANTS
from statsmodels.stats.proportion import proportions_ztest
teststatistic, pvalue = proportions_ztest([Contestant_operating, winners_operating], [Contestant, winners])
print('pvalue :',round(pvalue,3))
print('alpha :',alpha)
pvalue : 0.037 alpha : 0.05
Since the p-value, 0.037 < 0.05 (alpha) the difference is significant and we reject the Null hypothesis
and can say that
The proportion of companies that are operating is significantly different from each other, between WINNERS and CONTESTANTS
Select only the Event that has ‘disrupt’ keyword from 2013 onwards
Events = df1[df1['Event'].apply(lambda x: 'Disrupt' in x and int(x[-4:]) > 2012)]['Event'] # all the events of relevance
Events
1 Disrupt NYC 2013
3 Disrupt NY 2016
13 Disrupt SF 2015
14 Disrupt London 2016
16 Disrupt SF 2015
...
635 Disrupt NY 2015
641 Disrupt NYC 2013
642 Disrupt SF 2014
646 Disrupt London 2015
660 Disrupt SF 2013
Name: Event, Length: 185, dtype: object